Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams

نویسندگان

  • Tamraparni Dasu
  • Shankar Krishnan
  • Dongyu Lin
  • Suresh Venkatasubramanian
  • Kevin Yi
چکیده

Data streams are dynamic, with frequent distributional changes. In this paper, we propose a statistical approach to detecting distributional shifts in multi-dimensional data streams. We use relative entropy, also known as the Kullback-Leibler distance, to measure the statistical distance between two distributions. In the context of a multidimensional data stream, the distributions are generated by data from two sliding windows. We maintain a sample of the data from the stream inside the windows to build the distributions. Our algorithm is streaming, nonparametric, and requires no distributional or model assumptions. It employs the statistical theory of hypothesis testing and bootstrapping to determine whether the distributions are statistically different. We provide a full suite of experiments on synthetic data to validate the method and demonstrate its effectiveness on data from real-life applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows

Todays, in many modern applications, we search for frequent and repeating patterns in the analyzed data sets. In this search, we look for patterns that frequently appear in data set and mark them as frequent patterns to enable users to make decisions based on these discoveries. Most algorithms presented in the context of data stream mining and frequent pattern detection, work either on uncertai...

متن کامل

Learning to Classify Data Streams with Imbalanced Class Distributions

Streaming data is pervasive in a multitude of data mining applications. One fundamental problem in the task of mining streaming data is distributional drift over time. Streams may also exhibit high and varying degrees of class imbalance, which can further complicate the task. In scenarios like these, class imbalance is particularly difficult to overcome and has not been as thoroughly studied. I...

متن کامل

Detection of changes in dependent processes: Learning from algorithms, simulations and stochastic inference

Many present day data are sequentially observed discrete-time processes, i.e. they represent data streams where the data associated to the nth time instant is available with negligible delay. The problem to design and study monitoring procedures which aim at detecting changes in the structure of the process has recently received substantial and growing interest. We provide an overview of recent...

متن کامل

Modeling Baseline Shifts in Multivariate Disease Outbreak Detection

Methods Existing multivariate algorithms only model disease-relevant data streams (e.g., anti-fever medication sales or patient visits with constitutional syndrome for detection of flu outbreak). On the contrary, we also incorporate a non-disease-relevant data stream as a control factor. We assume that the counts from all data streams follow a Multinomial distribution. Given this distribution, ...

متن کامل

Adaptive Methods for Classification in Arbitrarily Imbalanced and Drifting Data Streams

Streaming data is pervasive in a multitude of data mining applications. One fundamental problem in the task of mining streaming data is distributional drift over time. Streams may also exhibit high and varying degrees of class imbalance, which can further complicate the task. In scenarios like these, class imbalance is particularly difficult to overcome and has not been as thoroughly studied. I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009